Taxicabs are the only vehicles that have the right to pick up street-hailing and prearranged passengers anywhere in New York City. By law, there are 13,587 taxis in New York City and each taxi must have a medallion affixed to it. The fare estimation majorly depends on the duration of each trip along with other factors. Analyzing these factors would help in estimating the duration better, in turn, estimating the fare better.
We will predict the trip duration of a taxi trip in New York City based on traffic, weather and time based features.
The data is from public data recourse: Kaggle and additional attributes were added from Kaggle.
The data set structure demonstrates that it has 42 features and around 1.5mn records of taxi trips across New York for the year 2016.
The predictor variables are described below:
We will be using the training dataset to train our model and test dataset to predict the trip distance
We will conduct data cleaning and perform an exploratory data analysis to generate some insights influence of the following factors on the trip distance:
Weather conditions
Traffic conditions
Trends over time/Seasonality
We will find plot the correlations and inspect the P values to get statistically significant features which will be used as the predictors
We plan to develop a Linear Regression in order to predict the trip duration
We will be using the RMSE and variance score as a metric to determine the performance of model in terms of prediction
For the New York Taxi Cab problem we are using the data provided in the Kaggle. Additional weather data has been included to account for the impact of weather conditions on trip duration. The weather data is NYC METAR data. A METAR weather report is predominantly used by pilots in fulfillment of a part of a pre-flight weather briefing, and by meteorologists, who use aggregated METAR information to assist in weather forecasting. Our data is aggregated information at hourly level for 2016.
To check the impact of various traffic related conditions we have additional imported OSRM data. OSRM stands for Open Source Routing Machine. It provides us with shortest paths in road networks and offers a multitude of functionality that is potentially useful when implementing maps/location based functionality. Our OSRM features had shortest distance and other road condition related features.
We. create our final data set with 42 features from the above 3 data sources. Our response variable is trip duration. For our analysis we have used the following libraries.
library(dplyr)
library(ggplot2)
library(Hmisc)
library(corrgram)
library(stats)
library(funModeling)
library(caret)
library(lubridate)
Once we have imported our necessary libraries we create our consolidated training data set.
### Importing our training data, weather data and OSRM data ###
train <- read.csv('train.csv')
weather <- read.csv('KNYC_Metars.csv')
osrm <- read.csv("train_augmented.csv")
### Creation of final dataset ###
##Adding an hourly column to the data
train$pickup_hour <- paste(substring(train$pickup_datetime,1,13), ":00:00", sep = "")
## Joining with weather data
final_weather <- left_join(train, weather, by = c("pickup_hour" = "Time") )
##Joining with OSRM data
final_data <- left_join(final_weather, osrm, by = "id")
## Creating a consolidated file for further analysis
write.csv(final_data, file = "Train_consolidated.csv")
### Importing final data set ##
train <- read.csv("C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/Train_consolidated.csv" , header = TRUE)
test <- read.csv("C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/test_consolidate.csv" , header = TRUE)
head(train)
## X id vendor_id pickup_datetime dropoff_datetime passenger_count
## 1 1 id2875421 2 2016-03-14 17:24:55 2016-03-14 17:32:30 1
## 2 2 id2377394 1 2016-06-12 00:43:35 2016-06-12 00:54:38 1
## 3 3 id3858529 2 2016-01-19 11:35:24 2016-01-19 12:10:48 1
## 4 4 id3504673 2 2016-04-06 19:32:31 2016-04-06 19:39:40 1
## 5 5 id2181028 2 2016-03-26 13:30:55 2016-03-26 13:38:10 1
## 6 6 id0801584 2 2016-01-30 22:01:40 2016-01-30 22:09:03 6
## pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude
## 1 -73.98215 40.76794 -73.96463 40.76560
## 2 -73.98042 40.73856 -73.99948 40.73115
## 3 -73.97903 40.76394 -74.00533 40.71009
## 4 -74.01004 40.71997 -74.01227 40.70672
## 5 -73.97305 40.79321 -73.97292 40.78252
## 6 -73.98286 40.74220 -73.99208 40.74918
## store_and_fwd_flag trip_duration pickup_hour Temp. Windchill
## 1 N 455 2016-03-14 17:00:00 4.4 -0.5
## 2 N 663 2016-06-12 00:00:00 28.9 NA
## 3 N 2124 2016-01-19 11:00:00 -6.7 -14.3
## 4 N 429 2016-04-06 19:00:00 7.2 3.3
## 5 N 435 2016-03-26 13:00:00 9.4 NA
## 6 N 443 2016-01-30 22:00:00 3.9 0.2
## Heat.Index Humidity Pressure Dew.Point Visibility Wind.Dir Wind.Speed
## 1 NA 0.86 1017.5 2.2 8.0 ENE 27.8
## 2 29.9 0.53 1006.6 18.3 16.1 West 7.4
## 3 NA 0.46 1016.3 -16.7 16.1 West 24.1
## 4 NA 0.39 1019.1 -6.1 16.1 South 25.9
## 5 NA 0.46 1026.9 -1.7 16.1 Variable 9.3
## 6 NA 0.41 1015.8 -8.3 16.1 SW 16.7
## Gust.Speed Precip Events Conditions distance duration motorway trunk
## 1 57.4 0.3 None Overcast 2009.1 160.9 0 0.00000
## 2 0.0 0.0 None Unknown 2513.4 256.5 0 0.00000
## 3 46.3 0.0 None Clear 9910.7 679.6 0 0.54282
## 4 35.2 0.0 None Clear 1779.1 181.8 0 0.00000
## 5 0.0 0.0 None Clear 1615.0 132.2 0 0.00000
## 6 29.6 0.0 None Clear 1393.6 142.9 0 0.00000
## primary secondary tertiary unclassified residential nTrafficSignals
## 1 0 0.000000 1.0000000 0.000000 0.00000000 14
## 2 0 0.348518 0.1747760 0.000000 0.14390300 25
## 3 0 0.372717 0.0398055 0.000000 0.00686116 38
## 4 0 0.000000 0.4244520 0.000000 0.03974140 18
## 5 0 0.637338 0.3626630 0.000000 0.00000000 17
## 6 0 0.000000 0.2383930 0.537783 0.22382500 12
## nCrossing nStop nIntersection srcCounty dstCounty bug
## 1 5 0 4 1 1 0
## 2 13 0 0 1 1 0
## 3 12 0 3 1 1 0
## 4 6 0 1 1 1 0
## 5 2 0 2 1 1 0
## 6 3 0 0 1 1 0
str(train)
## 'data.frame': 1458644 obs. of 42 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ id : chr "id2875421" "id2377394" "id3858529" "id3504673" ...
## $ vendor_id : int 2 1 2 2 2 2 1 2 1 2 ...
## $ pickup_datetime : chr "2016-03-14 17:24:55" "2016-06-12 00:43:35" "2016-01-19 11:35:24" "2016-04-06 19:32:31" ...
## $ dropoff_datetime : chr "2016-03-14 17:32:30" "2016-06-12 00:54:38" "2016-01-19 12:10:48" "2016-04-06 19:39:40" ...
## $ passenger_count : int 1 1 1 1 1 6 4 1 1 1 ...
## $ pickup_longitude : num -74 -74 -74 -74 -74 ...
## $ pickup_latitude : num 40.8 40.7 40.8 40.7 40.8 ...
## $ dropoff_longitude : num -74 -74 -74 -74 -74 ...
## $ dropoff_latitude : num 40.8 40.7 40.7 40.7 40.8 ...
## $ store_and_fwd_flag: chr "N" "N" "N" "N" ...
## $ trip_duration : int 455 663 2124 429 435 443 341 1551 255 1225 ...
## $ pickup_hour : chr "2016-03-14 17:00:00" "2016-06-12 00:00:00" "2016-01-19 11:00:00" "2016-04-06 19:00:00" ...
## $ Temp. : num 4.4 28.9 -6.7 7.2 9.4 3.9 18.9 13.9 25.6 23.9 ...
## $ Windchill : num -0.5 NA -14.3 3.3 NA 0.2 NA NA NA NA ...
## $ Heat.Index : num NA 29.9 NA NA NA NA NA NA NA NA ...
## $ Humidity : num 0.86 0.53 0.46 0.39 0.46 0.41 0.54 0.55 0.6 0.38 ...
## $ Pressure : num 1018 1007 1016 1019 1027 ...
## $ Dew.Point : num 2.2 18.3 -16.7 -6.1 -1.7 -8.3 9.4 5 17.2 8.9 ...
## $ Visibility : num 8 16.1 16.1 16.1 16.1 16.1 16.1 16.1 12.9 16.1 ...
## $ Wind.Dir : chr "ENE" "West" "West" "South" ...
## $ Wind.Speed : num 27.8 7.4 24.1 25.9 9.3 16.7 7.4 0 9.3 7.4 ...
## $ Gust.Speed : num 57.4 0 46.3 35.2 0 29.6 0 0 0 0 ...
## $ Precip : num 0.3 0 0 0 0 0 0 0 0 0 ...
## $ Events : chr "None" "None" "None" "None" ...
## $ Conditions : chr "Overcast" "Unknown" "Clear" "Clear" ...
## $ distance : num 2009 2513 9911 1779 1615 ...
## $ duration : num 161 256 680 182 132 ...
## $ motorway : num 0 0 0 0 0 0 0 0 0 0 ...
## $ trunk : num 0 0 0.543 0 0 ...
## $ primary : num 0 0 0 0 0 0 0 0 0 0 ...
## $ secondary : num 0 0.349 0.373 0 0.637 ...
## $ tertiary : num 1 0.1748 0.0398 0.4245 0.3627 ...
## $ unclassified : num 0 0 0 0 0 ...
## $ residential : num 0 0.1439 0.00686 0.03974 0 ...
## $ nTrafficSignals : int 14 25 38 18 17 12 19 33 9 58 ...
## $ nCrossing : int 5 13 12 6 2 3 0 18 2 11 ...
## $ nStop : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nIntersection : int 4 0 3 1 2 0 2 23 0 6 ...
## $ srcCounty : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dstCounty : int 1 1 1 1 1 1 1 4 1 1 ...
## $ bug : int 0 0 0 0 0 0 0 0 0 0 ...
summary(train)
## X id vendor_id pickup_datetime
## Min. : 1 Length:1458644 Min. :1.000 Length:1458644
## 1st Qu.: 364662 Class :character 1st Qu.:1.000 Class :character
## Median : 729323 Mode :character Median :2.000 Mode :character
## Mean : 729323 Mean :1.535
## 3rd Qu.:1093983 3rd Qu.:2.000
## Max. :1458644 Max. :2.000
##
## dropoff_datetime passenger_count pickup_longitude pickup_latitude
## Length:1458644 Min. :0.000 Min. :-121.93 Min. :34.36
## Class :character 1st Qu.:1.000 1st Qu.: -73.99 1st Qu.:40.74
## Mode :character Median :1.000 Median : -73.98 Median :40.75
## Mean :1.665 Mean : -73.97 Mean :40.75
## 3rd Qu.:2.000 3rd Qu.: -73.97 3rd Qu.:40.77
## Max. :9.000 Max. : -61.34 Max. :51.88
##
## dropoff_longitude dropoff_latitude store_and_fwd_flag trip_duration
## Min. :-121.93 Min. :32.18 Length:1458644 Min. : 1
## 1st Qu.: -73.99 1st Qu.:40.74 Class :character 1st Qu.: 397
## Median : -73.98 Median :40.75 Mode :character Median : 662
## Mean : -73.97 Mean :40.75 Mean : 959
## 3rd Qu.: -73.96 3rd Qu.:40.77 3rd Qu.: 1075
## Max. : -61.34 Max. :43.92 Max. :3526282
##
## pickup_hour Temp. Windchill Heat.Index
## Length:1458644 Min. :-18.30 Min. :-28.4 Min. :26.6
## Class :character 1st Qu.: 3.90 1st Qu.: -4.5 1st Qu.:27.4
## Mode :character Median : 10.60 Median : -0.5 Median :27.8
## Mean : 11.06 Mean : -1.5 Mean :28.4
## 3rd Qu.: 18.00 3rd Qu.: 2.6 3rd Qu.:29.1
## Max. : 32.20 Max. : 7.1 Max. :33.6
## NA's :12032 NA's :977900 NA's :1412890
## Humidity Pressure Dew.Point Visibility
## Min. :0.090 Min. : 989.5 Min. :-28.300 Min. : 0.40
## 1st Qu.:0.380 1st Qu.:1010.2 1st Qu.: -6.100 1st Qu.:14.50
## Median :0.490 Median :1015.9 Median : 1.100 Median :16.10
## Mean :0.521 Mean :1015.6 Mean : 0.644 Mean :14.69
## 3rd Qu.:0.650 3rd Qu.:1020.9 3rd Qu.: 8.300 3rd Qu.:16.10
## Max. :1.000 Max. :1036.8 Max. : 20.600 Max. :16.10
## NA's :12032 NA's :56031 NA's :12032 NA's :56597
## Wind.Dir Wind.Speed Gust.Speed Precip
## Length:1458644 Min. : 0.000 Min. : 0.00 Min. : 0.000
## Class :character 1st Qu.: 5.600 1st Qu.: 0.00 1st Qu.: 0.000
## Mode :character Median : 9.300 Median : 0.00 Median : 0.000
## Mean : 8.962 Mean : 8.01 Mean : 0.068
## 3rd Qu.: 13.000 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :137.000 Max. :72.20 Max. :11.900
## NA's :12032 NA's :12032 NA's :12032
## Events Conditions distance duration
## Length:1458644 Length:1458644 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 1654 1st Qu.: 155.4
## Mode :character Mode :character Median : 2730 Median : 249.2
## Mean : 4569 Mean : 343.5
## 3rd Qu.: 5049 3rd Qu.: 424.0
## Max. :1038310 Max. :46645.3
## NA's :1 NA's :1
## motorway trunk primary secondary
## Min. :0.0000000 Min. :0.000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000000 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000000 Median :0.000 Median :0.00000 Median :0.1314
## Mean :0.0004421 Mean :0.126 Mean :0.03019 Mean :0.2665
## 3rd Qu.:0.0000000 3rd Qu.:0.000 3rd Qu.:0.00000 3rd Qu.:0.4580
## Max. :1.0000000 Max. :1.000 Max. :1.00000 Max. :1.0000
## NA's :1 NA's :1 NA's :1 NA's :1
## tertiary unclassified residential nTrafficSignals
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. : 0.0
## 1st Qu.:0.08329 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 14.0
## Median :0.34729 Median :0.00000 Median :0.00000 Median : 21.0
## Mean :0.41943 Mean :0.01338 Mean :0.04399 Mean : 24.5
## 3rd Qu.:0.75704 3rd Qu.:0.00000 3rd Qu.:0.02493 3rd Qu.: 32.0
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :112.0
## NA's :1 NA's :1 NA's :1 NA's :1
## nCrossing nStop nIntersection srcCounty
## Min. : 0.000 Min. : 0.0000 Min. : 0.00 Min. :1.000
## 1st Qu.: 2.000 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.:1.000
## Median : 6.000 Median : 0.0000 Median : 1.00 Median :1.000
## Mean : 8.812 Mean : 0.1153 Mean : 2.52 Mean :1.214
## 3rd Qu.:13.000 3rd Qu.: 0.0000 3rd Qu.: 3.00 3rd Qu.:1.000
## Max. :91.000 Max. :10.0000 Max. :132.00 Max. :5.000
## NA's :1 NA's :1 NA's :1 NA's :1228
## dstCounty bug
## Min. :1.000 Min. :0.000000
## 1st Qu.:1.000 1st Qu.:0.000000
## Median :1.000 Median :0.000000
## Mean :1.271 Mean :0.003143
## 3rd Qu.:1.000 3rd Qu.:0.000000
## Max. :5.000 Max. :3.000000
## NA's :6339
We now check the distribution of our response variable trip duration.
## Distribution of response variable
attach(train)
hist(trip_duration)
The response variable trip duration is given in seconds in the data and we observe there are few outlier values in the data from histogram. We also observe that trip duration is highly skewed to the left.
Now we compare the structure of our test data with our train data to understand is the train data distribution can be compared with the test for accurate prediction of trip duration
train1<- train[, !colnames(train) %in% c("bug")]
test1<- test[, !colnames(test) %in% c("Day")]
train1$set <- "train"
test1$set <- "test"
train_col <- colnames(train1)
test1$trip_duration <- NA
test1$dropoff_datetime <- NA
test2 <- test1[,train_col]
full <- rbind(train1, test1)
obs_day <- full %>%
mutate(
pickup_date = ymd_hms(pickup_datetime),
pickup_date = date(pickup_datetime)
) %>%
group_by(set, pickup_date) %>%
summarise(
obs = n()
)
ggplot(obs_day, aes(x=pickup_date, y=obs, group=set, colour=set)) +
geom_line() +
facet_grid(set~.) +
theme_minimal()
As seen from the plot above, the distribution of train and test data is very similar. For this reason, we have identified the outliers but did not treat them as the treatment would have impacted our model’s prediction on the test data.
From our above summary we observe that there are few NULL values present in our weather variables - Temp., Humidity, Pressure, Dew.Point, Visibility, Wind.Speed, Gust.Speed, Precip, Events, Condition. For the missing values in the continuous variables in the weather data we replace the values with mean of the month and for the categorical variables we replace the missing values with mode.
library(dplyr)
weather <- read.csv("C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/KNYC_Metars.csv")
## Creating a function in R calculate mode for categorical variables ##
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
## Calculating monthly average for weather variables ##
final_weather <- weather %>% mutate(Date_avg = substring(Time,1,10)) %>%
group_by(Date_avg) %>%
summarise(Temp_mean = mean(Temp.), Humidity_mean = mean(Humidity), Press_mean = mean(Pressure),
Dew_mean = mean(Dew.Point), Visibility_mean = mean(Visibility), Winddir_mode = getmode(Wind.Dir),
Windspeed_mean = mean(Wind.Speed), Gustsp_mean = mean(Gust.Speed), Precip_mean = mean(Precip),
Events_mode = getmode(Events), Conditions_mode = getmode(Conditions)) %>%
select(Date_avg, Temp_mean, Humidity_mean, Press_mean, Dew_mean, Visibility_mean, Winddir_mode,
Windspeed_mean, Gustsp_mean, Precip_mean, Events_mode, Conditions_mode)
## Extracting month date in our train data ##
train$Day <- substring(train$pickup_datetime, 1, 10)
#Replacing all NULL weather values in train data with mean and mode values
train_colnames <- colnames(train)
train_natreat <- left_join(train, final_weather, by = c("Day" = "Date_avg"))
train_natreat$Temp.[is.na(train_natreat$Temp.)] <- train_natreat$Temp_mean[is.na(train_natreat$Temp.)]
train_natreat$Humidity[is.na(train_natreat$Humidity)] <- train_natreat$Humidity_mean[is.na(train_natreat$Humidity)]
train_natreat$Pressure[is.na(train_natreat$Pressure)] <- train_natreat$Press_mean[is.na(train_natreat$Pressure)]
train_natreat$Dew.Point[is.na(train_natreat$Dew.Point)] <- train_natreat$Dew_mean[is.na(train_natreat$Dew.Point)]
train_natreat$Visibility[is.na(train_natreat$Visibility)] <- train_natreat$Visibility_mean[is.na(train_natreat$Visibility)]
train_natreat$Wind.Dir[is.na(train_natreat$Wind.Dir)] <- train_natreat$Winddir_mode[is.na(train_natreat$Wind.Dir)]
train_natreat$Wind.Speed[is.na(train_natreat$Wind.Speed)] <- train_natreat$Windspeed_mean[is.na(train_natreat$Wind.Speed)]
train_natreat$Gust.Speed[is.na(train_natreat$Gust.Speed)] <- train_natreat$Gustsp_mean[is.na(train_natreat$Gust.Speed)]
train_natreat$Precip[is.na(train_natreat$Precip)] <- train_natreat$Precip_mean[is.na(train_natreat$Precip)]
train_natreat$Events[is.na(train_natreat$Events)] <- train_natreat$Events_mode[is.na(train_natreat$Events)]
train_natreat$Conditions[is.na(train_natreat$Conditions)] <- train_natreat$Conditions_mode[is.na(train_natreat$Conditions)]
train <- train_natreat %>% select(all_of(train_colnames))
remove(train_natreat)
Finally we also have 1 NULL value in our distance variable that can be observed from the summary statistics provided above. We remove this observation from our final data.
## Removing the row with distance as NULL ##
train <- train %>% filter(is.na(distance) == FALSE)
In our data each trip is identified by a unique ID column. We check if our id has any duplicates in them.
## Duplicate check ##
length(unique(train$id))
## [1] 1458643
From the above check we can conclude our data doesn’t have any duplicate values.
we now check our trip duration variable. As our trip duration variable is skewed we take log transformation.
hist(log(train$trip_duration),
main = "Distribution of log transformation of trip duration")
We check our distance variable for outliers.
## Checking distribution of distance ##
quantile(train$distance, c(.10, .20, .30, .40, .50, .60, .70, .80, .90, 1), na.rm = TRUE)
## 10% 20% 30% 40% 50% 60% 70%
## 1083.20 1472.24 1840.30 2246.70 2729.70 3361.40 4315.70
## 80% 90% 100%
## 6108.40 10451.10 1038310.00
From this we can identify there are outliers at greater than 99 percentile. We don’t remove these outliers as our test data also has these values, if we remove these from our training data the model will not be able to predict.
To understand the different trends in the data we have divided our data in the following sections :
** Trends Observed in the Pickup and Dropoff locations **
We check our pick up and drop off location to understand which areas have the highest pickup and drop off frequency. From the below pickup location distribution (created in tableau).
Checking our locations which have the most frequent pickup and drop off we observe the area around JFK has the highest pickup and drop off frequency.
The pickup and drop off counties have been embedded in the data as follows: NA: Not in NYC
As per this embedding, we observed the density of pickup and drop off locations
The density of drop off locations for Manhattan, Brooklyn and Queens is relatively higher than pickup locations. Manhattan and Brooklyn appear to be the counties where majority pickups happened whereas Bronx has the least amount of pickups. The high frequency of Manhattan can be explained by the dense population in Manhattan as well as most of office located there which translates to higher commuters traffic.
Study of the types of roads
The trips comprise of 7 types of roads provided in the data. We saw the distribution of trips across the types of roads.
library(ggplot2)
roads <- train[c("motorway","trunk","primary","secondary","tertiary","unclassified","residential")]
roads <- roads %>% filter(is.na(motorway) == FALSE)
sumdata=data.frame(value=apply(roads,2,sum))
sumdata$Road_Type=rownames(sumdata)
a <- ggplot(data=sumdata, aes(x=reorder(Road_Type, -value), y=value )) +
geom_bar(stat="identity", fill = "#0EA7A5") +
labs(x = "Road type", y = "Number of trips", title = "Number of trips by road type")
a
As seen in the above graph, majority section of the roads on which the trips are happening are tertiary roads, which are roads that link smaller town or villages, followed by secondary roads. Also, there is barely any volume of trips happening on motorways.This can be corroborated from our above location frequency map which shows most trips are inside the city premise.
Weather conditions impacting trip duration
In our data we have 13 weather related features - Temperature, Humidity, Pressure, Dew Point, Visibility, Wind Speed, Gust Speed, Precipitation , Events and Condition. We check this against our trip duration to see if there are any significant trends that could be observed.
bar_condition <- train%>% mutate(trip_hours = trip_duration/(60*60), trip_min = trip_duration/60) %>% filter(trip_hours < 7, Conditions != c("Unknown")) %>% group_by(Conditions) %>%
summarise(avg_trip_duration_min = mean(trip_min)) %>%
select(Conditions, avg_trip_duration_min) %>%
ggplot(.) +
geom_col(width = .9, fill = "#0EA7A5", mapping = aes(x = reorder(Conditions, -avg_trip_duration_min), y = avg_trip_duration_min)) +
theme (axis.text.x = element_text(size = 6, angle = 45),
axis.text.y = element_text(size = 6, angle = 90)) +
labs(x = "Conditions", y = "Avg Trip duration(min)", title = "Avg trip duration by conditions")
bar_condition
From the above graph we see that we have highest trip duration during rain and haze. This can be attributed to the fact that during such conditions the visibility and the road conditions deteriorates thus increasing trip duration.
# Distribution for temperature
line_temp <- train %>% mutate(trip_hours = trip_duration/(60*60), trip_min = trip_duration/60) %>% filter(trip_hours < 7) %>% group_by(Temp.) %>% summarise(avg_trip_duration_min = mean(trip_min)) %>% select(Temp., avg_trip_duration_min) %>%
ggplot(.) +
geom_line(mapping = aes(x = Temp., y = avg_trip_duration_min, fill = "#0EA7A5")) +
labs(x = "Temperature(celsius)", y = "Avg Trip duration (min)", title = "Trend of avg trip duration by temperature")
## Warning: Ignoring unknown aesthetics: fill
line_temp
From the above temperature chart we notice that people that higher trip duration can be seen when the temperature is 15-20 Celsius. This can be attributed to the fact that more people tend to take cab rides when the weather is in this range.
line_visibility <- train %>% mutate(trip_hours = trip_duration/(60*60), trip_min = trip_duration/60) %>% filter(trip_hours < 7) %>% group_by(Visibility) %>% summarise(avg_trip_duration_min = mean(trip_min)) %>% select(Visibility, avg_trip_duration_min) %>%
ggplot(.) +
geom_line(mapping = aes(x = Visibility, y = avg_trip_duration_min, fill = "#0EA7A5")) +
labs(x = "Visibility", y = "Avg Trip duration (min)", title = "Trend of avg trip duration by Visibility")
## Warning: Ignoring unknown aesthetics: fill
line_visibility
## Warning: Removed 1 row(s) containing missing values (geom_path).
We notice that during low visibility the average trip duration of the cab rides increases. This is expected as due to low visibility, like in conditions haze and rain cabs are forced to drive at lower speed or the traffic conditions deteriorate.
Time based analysis of the No. of Pickups
Exploring the number of pick ups with respect to time, and analyzing which hour of the day has the highest number of pickups.
#Extracting day, month and time in the data by parsing the 'Pickup_datetime' and 'dropoff_datetime' in the data
train$Pickup_Day <- substring(train$pickup_datetime, 1, 10)
train$PickupDay <- format(as.Date(train$Pickup_Day,format="%Y-%m-%d"), format = "%d")
train$Pickupmonth <- format(as.Date(train$Pickup_Day,format="%Y-%m-%d"), format = "%m")
#extracting hour of the day time from pickup_datetime
train$Pickup_time <- format(as.POSIXct(train$pickup_datetime,format="%Y-%m-%d %H:%M:%S"), format = "%H")
#day
train$day <- format(c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday")[as.POSIXlt(train$Pickup_Day)$wday + 1])
#Exploratory data Analysis of day-month and time across the data
#HOUR
#counting the number of pickups by hour
count_of_pickups_wrt_hour <- train %>% count(Pickup_time)
new_count_of_pickups_wrt_hour <- na.omit(count_of_pickups_wrt_hour)
pickup_count <- new_count_of_pickups_wrt_hour[, c("n")]
names_of_hour <- new_count_of_pickups_wrt_hour[, c("Pickup_time")]
hourly_pick_ups <- ggplot(data=new_count_of_pickups_wrt_hour, aes(x=names_of_hour, y=pickup_count )) +
geom_bar(stat="identity", fill = "#0EA7A5")+
labs(x = "Hour of the day", y = "Number of trips", title = "Hourly Pick ups")
hourly_pick_ups
As we can observe from the above chart that the number of trips starts to increase from 8 AM and dips at 2PM, we see there is again a surge of pickups in the evening from 6PM to 10 PM. This behavior can be attributed to office timing and majority of the people hailing cabs are these commuters.
Exploring the number of pick ups with respect to day, and analyzing which day of the week has the highest number of pickups.
#DAY
#counting the number of pickup wrt day
count_of_pickups_wrt_day <- train %>% count(day)
new_count_of_pickups_wrt_day <- na.omit(count_of_pickups_wrt_day)
day_pickup_count <- new_count_of_pickups_wrt_day[, c("n")]
names_of_day <- as.factor(new_count_of_pickups_wrt_day[, c("day")])
names_of_day <- factor(new_count_of_pickups_wrt_day[, c("day")])
daily_pick_ups <- ggplot(data= new_count_of_pickups_wrt_day, aes(x= reorder(names_of_day, -day_pickup_count), y=day_pickup_count )) +
geom_bar(stat="identity", fill = "#0EA7A5") +
labs(x = "Day of the week", y = "Number of trips", title = "Weekly Pick ups")
daily_pick_ups
From the above graph we can see that the highest number of pickups takes place on Friday of the week and we can see that low numbers of pick ups takes place during Monday.
Exploring the number of pick ups with respect to month and analyzing which month of the 2016 year has the highest number of pickups.
#MONTH
#counting the number of pickup wrt month
count_of_pickups_wrt_month <- train %>% count(Pickupmonth)
new_count_of_pickups_wrt_month <- na.omit(count_of_pickups_wrt_month)
month_pickup_count <- new_count_of_pickups_wrt_month[, c("n")]
names_of_month <- new_count_of_pickups_wrt_month[, c("Pickupmonth")]
monthly_pick_ups <- ggplot(data= new_count_of_pickups_wrt_month, aes(x= names_of_month ,y=month_pickup_count)) +
geom_bar(stat="identity", fill = "#0EA7A5") +
labs(x = "Months", y = "Number of trips", title = "Monthly Pickups")
monthly_pick_ups
#ggplot(tips2, aes(x = reorder(day, -perc), y = perc)) + geom_bar(stat = "identity")
In our data we have data from January 2016 to July of 2016. From the above graph, we can observe that highest number of pickups took place during the month of March
Exploration on Trip Duration with respect to hour, day, week and month
Exploring how the trip distribution varies across the data with respect to hour of the day.
#Exploration of average trip_duration in secs for daily, monthly and hourly based on the pick_up_time and drop_off_time across the trainset
#Hourly
train <- train %>% mutate(trip_duration_minutes = trip_duration / 60)
trip_duration_average_wrt_time <- aggregate(train$trip_duration_minutes, list(train$Pickup_time), FUN=mean)
time_1 <- trip_duration_average_wrt_time[,c('Group.1')]
avg_trip_duration <- trip_duration_average_wrt_time[,c('x')]
ggplot(data=trip_duration_average_wrt_time, aes(x=time_1, y=avg_trip_duration, group=1)) +
geom_line()+
geom_point()+
labs(x = "Time of day", y = " Average trip duration min", title = "Average trip duration wrt hours")
From the graph we can see that the average trip duration increases as the day begins i.e from 8 am and we can see a high pick increase in the trip duration in the afternoon and it gradually decreases and then we can see a sudden increase during the night. This can be correlated to higher number of cab rides during these hours which basically are the commuting hours. During this time there are more vehicles on the road thus we can see the average trip duration also increases.
Exploring how the trip distribution varies across the data with respect to day of the week.
#Day
trip_duration_average_wrt_day <- aggregate(train$trip_duration_minutes , list(train$PickupDay), FUN=mean)
average_daily_trip_duration <- trip_duration_average_wrt_day[, c("x")]
each_day <- trip_duration_average_wrt_day[, c("Group.1")]
average_trip_durations_daily <- ggplot(data=trip_duration_average_wrt_day, aes(x=each_day,
y=average_daily_trip_duration )) +
geom_bar(stat="identity", fill = "#0EA7A5")+
labs(x = "Day of month", y = "Average trip duration(min)", title = "Daily average trip durations")
average_trip_durations_daily
Highest average trip duration was recording during the 13th and 5th day of the month
Exploring how the trip distribution varies across the data with respect to week.
#Week
trip_duration_average_wrt_week <- aggregate(train$trip_duration_minutes , list(train$day), FUN=mean)
average_weekly_trip_duration <- trip_duration_average_wrt_week[, c("x")]
day_of_the_week <- trip_duration_average_wrt_week[, c("Group.1")]
average_trip_durations_weekly <- ggplot(data=trip_duration_average_wrt_week, aes(x= reorder(day_of_the_week, -average_weekly_trip_duration), y=average_weekly_trip_duration )) +
geom_bar(stat="identity", fill = "#0EA7A5")+
labs(x = "Day of Week", y = "Average trip duration (min)", title = "Weekly average trip durations")
average_trip_durations_weekly
Thursday was the day when the average trip duration was the highest.
Exploring how the trip distribution varies accross the data with respect to month
#Month
trip_duration_average_wrt_month <- aggregate(train$trip_duration_minutes , list(train$Pickupmonth), FUN=mean)
average_monthly_trip_duration <- trip_duration_average_wrt_month[, c("x")]
average_monthly <- trip_duration_average_wrt_month[, c("Group.1")]
average_trip_durations_monthly <- ggplot(data=trip_duration_average_wrt_month, aes(x= average_monthly, y=average_monthly_trip_duration )) +
geom_bar(stat="identity", fill = "#0EA7A5")+
labs(x = "Month", y = "Average trip duration (min)", title = "Monthly average trip durations")
average_trip_durations_monthly
The average trip duration increased in the month of June. We can see the average trip duration gradually increases over the months this can be attributed to the fact that more people go out during the summer months thus increasing the traffic on the road.
Exploration of trip duration with respect to Traffic Data
In our data we have 4 traffic related features - No. of Signals, No. of stops, No. of Intersection, No. of Crossings. We check this against our trip duration to see if there are any significant trends that could be observed.
Exploring how the trip duration varies with respect to no. of signals
#Exploring ow traffic signals have affected the average trip duration of the uber trips
#No.of signals
traffic_data <- train[, c("nTrafficSignals", "nCrossing", "nStop", "nIntersection", "trip_duration_minutes")]
train <- train %>% mutate(trip_duration_minutes = trip_duration / 60)
trip_duration_average_wrt_nTraffic_signals <- aggregate(traffic_data$trip_duration_minutes, list(traffic_data$nTrafficSignals), FUN=mean)
No_of_traffic_signals <- trip_duration_average_wrt_nTraffic_signals[,c('Group.1')]
avg_trip_duration <- trip_duration_average_wrt_nTraffic_signals[,c('x')]
ggplot(data=trip_duration_average_wrt_nTraffic_signals, aes(x=No_of_traffic_signals, y=avg_trip_duration, group=1)) +
geom_line()+
geom_point()+
labs(x= "Number of traffic Signals", y = "Average trip duration (min)", title = "Average trip duration wrt No. of Traffic Signals")
From the above graph we can see that the average trip duration time increases with no. of traffic signals as we would expect.
Exploring how the trip duration varies with respect to no. of stops
#No. of stops
trip_duration_average_wrt_stops <- aggregate(traffic_data$trip_duration_minutes, list(traffic_data$nStop), FUN=mean)
No_of_stops <- trip_duration_average_wrt_stops[,c('Group.1')]
avg_trip_duration <- trip_duration_average_wrt_stops[,c('x')]
ggplot(data=trip_duration_average_wrt_stops, aes(x=No_of_stops, y=avg_trip_duration, group=1)) +
geom_line()+
geom_point()+
labs(x = "number of Stops", y = "Average trip duration (min)", title = "Average trip duration wrt No. of stops")
From the above graph we notice the that the average trip duration increases initially with increase in stops. We see a dip in after because for multiple stops to occur the distance increase and this may not uniformly impact the trip duration.
Exploring how the trip duration varies with respect to no. of intersections
#NIntersections
trip_duration_average_wrt_intersections <- aggregate(traffic_data$trip_duration_minutes, list(traffic_data$nIntersection), FUN=mean)
No_of_Intersections <- trip_duration_average_wrt_intersections[,c('Group.1')]
avg_trip_duration <- trip_duration_average_wrt_intersections[,c('x')]
ggplot(data=trip_duration_average_wrt_intersections, aes(x=No_of_Intersections, y=avg_trip_duration, group=1)) +
geom_line()+
geom_point()+
labs(x = "Number of intersections", y = " Average trip duration (min)", title = "Average trip duration wrt No. of Intersections")
From the above graph we notice the that the average trip duration increases initially with increase in intersections as cabs have to stop at intersections which would contribute to increased trip duration. We see a dip in after because for multiple intersections to occur the distance increase and this may not uniformly impact the trip duration.
Exploring how the trip duration varies with respect to no. of pedestrian crossings
#Ncrossings
trip_duration_average_wrt_no_of_peds <- aggregate(traffic_data$trip_duration_minutes, list(traffic_data$nCrossing), FUN=mean)
No_of_peds <- trip_duration_average_wrt_no_of_peds[,c('Group.1')]
avg_trip_duration <- trip_duration_average_wrt_no_of_peds[,c('x')]
ggplot(data=trip_duration_average_wrt_no_of_peds, aes(x=No_of_peds, y=avg_trip_duration, group=1)) +
geom_line()+
geom_point()+
labs(x="number of pedestrian crossing", y = "Average trip duration (min)", title = "Average trip duration wrt No. of Pedestrians crossings")
The average trip duration was observed to have a steady increase with increase in No. of Pedestrians crossings and there is observed to be a peak value when the no. of pedestrian crossings reaches a certain number.
To understand the impact of our variables on our response variable trip duration view the correlation matrix for the variable. We have divided our predictor variables in 3 categories:
Weather variables
We have 13 weather related features in our data. Among our 13 features we have 2 categorical features and 11 continuous features. We check the correlation with our 11 continuous features. Among the 11 features we have removed Windchill and Heat Index as these columns have high numbers of NULL which we are not able to impute
#Weather
library(corrgram)
weather_cont %>% corrgram()
Traffic Variables
Evaluating the correlation between traffic variables.
#Traffic data
train <- train %>% mutate(trip_duration_minutes = trip_duration / 60)
traffic_data <- train[, c("nTrafficSignals", "nCrossing", "nStop", "nIntersection", "trip_duration_minutes")]
# checking for any missing values in the above variables of the data
sapply(traffic_data, function(x) sum(is.na(x)))
#replacing NA values with zeros
traffic_data[is.na(traffic_data)] <- 0
#checking the NA values in the data frame again
sapply(traffic_data, function(x) sum(is.na(x)))
#calculating the correlation graph
corrgram(traffic_data)
From the above graph we can observe that there has is a high positive correlation between No. of traffic signals and No. of Pedestrians crossings. There appears to be a slight strong correlation between No. of Traffic signals and No. of Intersections as well. We can observe that there is a weak correlation between rest of the variables.
Road Conditions
We are now going to observe the correlation between variables so as to understand if any of these variables should be excluded from our model.
corrgram::corrgram(clean_road_data)
As illustrated above, correlation is fairly weak across the board. The most salient being a coefficient of -0.54 between secondary and tertiary roads. In practical terms this makes sense as trip routes taken will likely either be by using tertiary or secondary roads. From this analyses we can conclude all road-type variables should be considered for modeling.
Consolidated correlation analysis
We studies the correlation between all our predictor variables in order to eliminate any highly correlated variables. The variables with correlation value > 0.7 are eliminated.
library(Hmisc)
Cor_cont <- train %>% select(passenger_count, Temp.,Windchill,Heat.Index,Humidity,Pressure,Dew.Point,Visibility,Wind.Speed,Gust.Speed,Precip,distance,nTrafficSignals,nCrossing,nStop,nIntersection)
Cor_all <- rcorr(as.matrix(Cor_cont))
## Warning in sqrt(npair - 2): NaNs produced
Cor_all
## passenger_count Temp. Windchill Heat.Index Humidity Pressure
## passenger_count 1.00 0.00 0.00 0.01 0.00 0.00
## Temp. 0.00 1.00 0.98 0.91 0.00 -0.22
## Windchill 0.00 0.98 1.00 NA 0.24 -0.20
## Heat.Index 0.01 0.91 NA 1.00 0.13 0.04
## Humidity 0.00 0.00 0.24 0.13 1.00 -0.34
## Pressure 0.00 -0.22 -0.20 0.04 -0.34 1.00
## Dew.Point 0.00 0.84 0.74 0.59 0.53 -0.37
## Visibility 0.00 -0.09 -0.06 -0.35 -0.61 0.16
## Wind.Speed 0.00 -0.27 -0.29 0.03 -0.03 -0.10
## Gust.Speed 0.00 -0.28 -0.33 0.14 -0.08 -0.17
## Precip 0.00 -0.02 0.05 NaN 0.28 -0.11
## distance 0.01 0.01 0.01 0.00 0.01 0.00
## nTrafficSignals 0.01 0.01 0.02 0.00 0.00 0.00
## nCrossing 0.01 0.00 0.01 0.01 0.00 0.00
## nStop 0.00 0.00 0.00 0.01 0.01 0.00
## nIntersection 0.00 0.00 0.01 0.00 0.01 0.00
## Dew.Point Visibility Wind.Speed Gust.Speed Precip distance
## passenger_count 0.00 0.00 0.00 0.00 0.00 0.01
## Temp. 0.84 -0.09 -0.27 -0.28 -0.02 0.01
## Windchill 0.74 -0.06 -0.29 -0.33 0.05 0.01
## Heat.Index 0.59 -0.35 0.03 0.14 NaN 0.00
## Humidity 0.53 -0.61 -0.03 -0.08 0.28 0.01
## Pressure -0.37 0.16 -0.10 -0.17 -0.11 0.00
## Dew.Point 1.00 -0.36 -0.25 -0.29 0.11 0.02
## Visibility -0.36 1.00 -0.04 0.02 -0.43 0.01
## Wind.Speed -0.25 -0.04 1.00 0.60 0.07 -0.01
## Gust.Speed -0.29 0.02 0.60 1.00 0.05 -0.01
## Precip 0.11 -0.43 0.07 0.05 1.00 0.00
## distance 0.02 0.01 -0.01 -0.01 0.00 1.00
## nTrafficSignals 0.01 0.02 -0.01 -0.01 -0.01 0.23
## nCrossing 0.01 0.01 -0.01 -0.01 0.00 0.10
## nStop 0.00 0.00 0.00 0.00 0.00 0.07
## nIntersection 0.01 0.01 -0.01 -0.01 0.00 0.44
## nTrafficSignals nCrossing nStop nIntersection
## passenger_count 0.01 0.01 0.00 0.00
## Temp. 0.01 0.00 0.00 0.00
## Windchill 0.02 0.01 0.00 0.01
## Heat.Index 0.00 0.01 0.01 0.00
## Humidity 0.00 0.00 0.01 0.01
## Pressure 0.00 0.00 0.00 0.00
## Dew.Point 0.01 0.01 0.00 0.01
## Visibility 0.02 0.01 0.00 0.01
## Wind.Speed -0.01 -0.01 0.00 -0.01
## Gust.Speed -0.01 -0.01 0.00 -0.01
## Precip -0.01 0.00 0.00 0.00
## distance 0.23 0.10 0.07 0.44
## nTrafficSignals 1.00 0.56 0.08 0.16
## nCrossing 0.56 1.00 0.14 0.03
## nStop 0.08 0.14 1.00 0.03
## nIntersection 0.16 0.03 0.03 1.00
##
## n
## passenger_count Temp. Windchill Heat.Index Humidity Pressure
## passenger_count 1458643 1458643 480744 45754 1458643 1410341
## Temp. 1458643 1458643 480744 45754 1458643 1410341
## Windchill 480744 480744 480744 0 480744 480744
## Heat.Index 45754 45754 0 45754 45754 45754
## Humidity 1458643 1458643 480744 45754 1458643 1410341
## Pressure 1410341 1410341 480744 45754 1410341 1410341
## Dew.Point 1458643 1458643 480744 45754 1458643 1410341
## Visibility 1409775 1409775 480744 45754 1409775 1409775
## Wind.Speed 1458643 1458643 480744 45754 1458643 1410341
## Gust.Speed 1458643 1458643 480744 45754 1458643 1410341
## Precip 1458643 1458643 480744 45754 1458643 1410341
## distance 1458643 1458643 480744 45754 1458643 1410341
## nTrafficSignals 1458643 1458643 480744 45754 1458643 1410341
## nCrossing 1458643 1458643 480744 45754 1458643 1410341
## nStop 1458643 1458643 480744 45754 1458643 1410341
## nIntersection 1458643 1458643 480744 45754 1458643 1410341
## Dew.Point Visibility Wind.Speed Gust.Speed Precip distance
## passenger_count 1458643 1409775 1458643 1458643 1458643 1458643
## Temp. 1458643 1409775 1458643 1458643 1458643 1458643
## Windchill 480744 480744 480744 480744 480744 480744
## Heat.Index 45754 45754 45754 45754 45754 45754
## Humidity 1458643 1409775 1458643 1458643 1458643 1458643
## Pressure 1410341 1409775 1410341 1410341 1410341 1410341
## Dew.Point 1458643 1409775 1458643 1458643 1458643 1458643
## Visibility 1409775 1409775 1409775 1409775 1409775 1409775
## Wind.Speed 1458643 1409775 1458643 1458643 1458643 1458643
## Gust.Speed 1458643 1409775 1458643 1458643 1458643 1458643
## Precip 1458643 1409775 1458643 1458643 1458643 1458643
## distance 1458643 1409775 1458643 1458643 1458643 1458643
## nTrafficSignals 1458643 1409775 1458643 1458643 1458643 1458643
## nCrossing 1458643 1409775 1458643 1458643 1458643 1458643
## nStop 1458643 1409775 1458643 1458643 1458643 1458643
## nIntersection 1458643 1409775 1458643 1458643 1458643 1458643
## nTrafficSignals nCrossing nStop nIntersection
## passenger_count 1458643 1458643 1458643 1458643
## Temp. 1458643 1458643 1458643 1458643
## Windchill 480744 480744 480744 480744
## Heat.Index 45754 45754 45754 45754
## Humidity 1458643 1458643 1458643 1458643
## Pressure 1410341 1410341 1410341 1410341
## Dew.Point 1458643 1458643 1458643 1458643
## Visibility 1409775 1409775 1409775 1409775
## Wind.Speed 1458643 1458643 1458643 1458643
## Gust.Speed 1458643 1458643 1458643 1458643
## Precip 1458643 1458643 1458643 1458643
## distance 1458643 1458643 1458643 1458643
## nTrafficSignals 1458643 1458643 1458643 1458643
## nCrossing 1458643 1458643 1458643 1458643
## nStop 1458643 1458643 1458643 1458643
## nIntersection 1458643 1458643 1458643 1458643
##
## P
## passenger_count Temp. Windchill Heat.Index Humidity Pressure
## passenger_count 0.0876 0.0319 0.0016 0.0000 0.0000
## Temp. 0.0876 0.0000 0.0000 0.0000 0.0000
## Windchill 0.0319 0.0000 0.0000 0.0000
## Heat.Index 0.0016 0.0000 0.0000 0.0000
## Humidity 0.0000 0.0000 0.0000 0.0000 0.0000
## Pressure 0.0000 0.0000 0.0000 0.0000 0.0000
## Dew.Point 0.1805 0.0000 0.0000 0.0000 0.0000 0.0000
## Visibility 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## Wind.Speed 0.2683 0.0000 0.0000 0.0000 0.0000 0.0000
## Gust.Speed 0.0544 0.0000 0.0000 0.0000 0.0000 0.0000
## Precip 0.6320 0.0000 0.0000 0.0000 0.0000
## distance 0.0000 0.0000 0.0000 0.8567 0.0000 0.0463
## nTrafficSignals 0.0000 0.0000 0.0000 0.6446 0.0000 0.1450
## nCrossing 0.0000 0.0000 0.0000 0.0051 0.5581 0.0199
## nStop 0.0068 0.2629 0.6635 0.2144 0.0000 0.0139
## nIntersection 0.0030 0.1234 0.0000 0.4869 0.0000 0.0001
## Dew.Point Visibility Wind.Speed Gust.Speed Precip distance
## passenger_count 0.1805 0.0000 0.2683 0.0544 0.6320 0.0000
## Temp. 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## Windchill 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## Heat.Index 0.0000 0.0000 0.0000 0.0000 0.8567
## Humidity 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## Pressure 0.0000 0.0000 0.0000 0.0000 0.0000 0.0463
## Dew.Point 0.0000 0.0000 0.0000 0.0000 0.0000
## Visibility 0.0000 0.0000 0.0000 0.0000 0.0000
## Wind.Speed 0.0000 0.0000 0.0000 0.0000 0.0000
## Gust.Speed 0.0000 0.0000 0.0000 0.0000 0.0000
## Precip 0.0000 0.0000 0.0000 0.0000 0.0000
## distance 0.0000 0.0000 0.0000 0.0000 0.0000
## nTrafficSignals 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## nCrossing 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## nStop 0.0000 0.0394 0.0028 0.0000 0.6320 0.0000
## nIntersection 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## nTrafficSignals nCrossing nStop nIntersection
## passenger_count 0.0000 0.0000 0.0068 0.0030
## Temp. 0.0000 0.0000 0.2629 0.1234
## Windchill 0.0000 0.0000 0.6635 0.0000
## Heat.Index 0.6446 0.0051 0.2144 0.4869
## Humidity 0.0000 0.5581 0.0000 0.0000
## Pressure 0.1450 0.0199 0.0139 0.0001
## Dew.Point 0.0000 0.0000 0.0000 0.0000
## Visibility 0.0000 0.0000 0.0394 0.0000
## Wind.Speed 0.0000 0.0000 0.0028 0.0000
## Gust.Speed 0.0000 0.0000 0.0000 0.0000
## Precip 0.0000 0.0000 0.6320 0.0000
## distance 0.0000 0.0000 0.0000 0.0000
## nTrafficSignals 0.0000 0.0000 0.0000
## nCrossing 0.0000 0.0000 0.0000
## nStop 0.0000 0.0000 0.0000
## nIntersection 0.0000 0.0000 0.0000
Based on the correlation coefficients obtained above, we have removed the following features from further analysis as they were highly correlated (coefficient >0.7): - Windchill(0.97 with Temp) - Heat Index (0.90 with Temp) - Dew Point(0.83 with Temp)
Based on our above correlation analysis we select our final features for trip duration prediction. Before we are able to feed the selected features in our model we will perform few pre processing to encode our categorical variables. We use One hot encoding to convert our categorical variable for our model. In One hot encoding each category in a variable is transformed into a column with binary value of 0 and 1.
We perform all these pre-processing steps on both our train and test data set to prepare for model input and prediction.
## Final which we require for our model to train##
train_model <- train %>% mutate(pickup_hour = format(as.POSIXct(train$pickup_datetime,format="%Y-%m-%d %H:%M:%S"),"%H"),
pickup_date = as.POSIXct(substring(train$pickup_datetime,1,10),format="%Y-%m-%d"),
pickup_month = month(as.POSIXct(substring(train$pickup_datetime,1,10),format="%Y-%m-%d")),
pickup_dayofweek = weekdays(as.POSIXct(train$pickup_datetime,format="%Y-%m-%d %H:%M:%S")),
pickup_Day = format(train$pickup_date, format = "%d"),
pickup_latitude3 = round(train$pickup_latitude, digits = 3),
pickup_longitude3 = round(train$pickup_longitude, digits = 3),
dropoff_latitude3 = round(train$dropoff_latitude, digits = 3),
dropoff_longitude3 = round(train$dropoff_longitude, digits = 3)
)
## Adding fields which we require for our model to test
test_model <- test %>% mutate(pickup_hour = format(as.POSIXct(test$pickup_datetime,format="%Y-%m-%d %H:%M:%S"),"%H"),
pickup_date = as.POSIXct(substring(test$pickup_datetime,1,10),format="%Y-%m-%d"),
pickup_month = month(as.POSIXct(substring(test$pickup_datetime,1,10),format="%Y-%m-%d")),
pickup_dayofweek = weekdays(as.POSIXct(test$pickup_datetime,format="%Y-%m-%d %H:%M:%S")),
pickup_Day = format(test$pickup_date, format = "%d"),
pickup_latitude3 = round(test$pickup_latitude, digits = 3),
pickup_longitude3 = round(test$pickup_longitude, digits = 3),
dropoff_latitude3 = round(test$dropoff_latitude, digits = 3),
dropoff_longitude3 = round(test$dropoff_longitude, digits = 3)
)
write.csv(train_model, file = "train_final_model.csv")
write.csv(test_model, file = "test_final_model.csv")
Now we have added our additional features to the data we can move to encoding the categorical variables and creating the final data for model.
## Adding fields which we require for our model to train##
train_model <- read.csv("C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/Final submission/train_final_model.csv")
#Changing to factors for encoding
train_model$vendor_id <- factor(train_model$vendor_id, exclude = NULL)
train_model$store_and_fwd_flag <- factor(train_model$store_and_fwd_flag, exclude = NULL)
train_model$Conditions <- factor(train_model$Conditions, exclude = NULL)
train_model$pickup_hour <- factor(train_model$pickup_hour, exclude = NULL)
train_model$pickup_dayofweek <- factor(train_model$pickup_dayofweek, exclude = NULL)
train_model$pickup_month <- factor(train_model$pickup_month, exclude = NULL)
train_model$pickup_Day <- factor(train_model$pickup_Day, exclude = NULL)
train_model$dstCounty <- factor(train_model$dstCounty, exclude = NULL)
train_model$srcCounty <- factor(train_model$dstCounty, exclude = NULL)
## Adding fields which we require for our model to test
test_model <- read.csv("C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/Final submission/test_final_model.csv")
#Changing to factor for encoding
test_model$vendor_id <- factor(test_model$vendor_id, exclude = NULL)
test_model$store_and_fwd_flag <- factor(test_model$store_and_fwd_flag, exclude = NULL)
test_model$Conditions <- factor(test_model$Conditions, exclude = NULL)
test_model$pickup_hour <- factor(test_model$pickup_hour, exclude = NULL)
test_model$pickup_dayofweek <- factor(test_model$pickup_dayofweek, exclude = NULL)
test_model$pickup_month <- factor(test_model$pickup_month, exclude = NULL)
test_model$pickup_Day <- factor(test_model$pickup_Day, exclude = NULL)
test_model$dstCounty <- factor(test_model$dstCounty, exclude = NULL)
test_model$srcCounty <- factor(test_model$dstCounty, exclude = NULL)
#### Encoding categorical variables in train
en_train <- train_model
## Changing NA values to factor for encoding
en_train$dstCounty <- addNA(en_train$dstCounty)
en_train$srcCounty <- addNA(en_train$srcCounty)
## One hot encoding
dv_train <- caret::dummyVars(" ~ vendor_id + store_and_fwd_flag + Conditions + pickup_hour + pickup_dayofweek + pickup_month + pickup_Day + dstCounty + srcCounty",
data = en_train)
en_train <- data.frame(predict(dv_train, newdata = train_model))
## Dropping NA values for pickup_hour, pickup_dayofweek and pickup_Day
en_train <- en_train[,!colnames(en_train) %in% c("pickup_hour.NA", "pickup_dayofweek.NA", "pickup_Day.NA")]
#### Creating final dataset for the model training
## Response variable
Y <- train_model$log_trip_duration
##Predictor variables
cont_cols <- c("passenger_count", "Temp.", "Humidity", "Wind.Speed", "Gust.Speed", "Precip", "distance",
"motorway", "trunk", "primary", "secondary", "tertiary", "unclassified", "residential", "nTrafficSignals","nCrossing", "nStop", "nIntersection")
X_cont <- train_model[,cont_cols]
x_dis <- en_train
X <- cbind(X_cont,x_dis)
train_final <- cbind(Y, X)
## Encoding categorical variables in test
en_test <- test_model
# Changing NA values to factor levels
en_test$dstCounty <- addNA(en_test$dstCounty)
en_test$srcCounty <- addNA(en_test$srcCounty)
#One hot encoding
dv_test <- caret::dummyVars(" ~ vendor_id + store_and_fwd_flag + Conditions + pickup_hour + pickup_dayofweek + pickup_month + pickup_Day + dstCounty + srcCounty",
data = en_test)
en_test <- data.frame(predict(dv_test, newdata = test_model))
## Creating final dataset for prediction
cont_cols_pred <- c("passenger_count", "Temp.", "Humidity", "Wind.Speed", "Gust.Speed", "Precip", "distance",
"motorway", "trunk", "primary", "secondary", "tertiary", "unclassified", "residential", "nTrafficSignals",
"nCrossing", "nStop", "nIntersection")
X_cont_pred <- test_model[,cont_cols]
x_dis_pred <- en_test
X_pred <- cbind(X_cont_pred,x_dis_pred)
After our feature engineering steps we now have 118 feature for our model. Before inputing our train data set to our model we split our data in 70:30 ratio to assess our model performance.
### Split train to 70/30 to assess model performance
set.seed(123)
train_split <- sample(nrow(train_final), 0.70*nrow(train_final))
model_train <- train_final[train_split,]
model_test <- train_final[-train_split,]
To predict the taxi trip duration in our test data set we have selected linear regression algorithm. Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x).
In our case our input variable x is our 118 features that we created in the above steps and our output y variable is log transformation of trip duration. Our model is given by.
lm_model1 <- lm(Y ~ ., data = model_train)
Now to test the performance of our model we check our out of sample performance of 30% data split we created previously from our training dataset.
OOS_lm_model1 <- predict(lm_model1, model_test)
## Warning in predict.lm(lm_model1, model_test): prediction from a rank-deficient
## fit may be misleading
### Defining our model assessment function
calc_performance <- function(actual, pred) {
rmse <- sqrt(mean((actual - pred)**2))
mae <- mean(abs(actual - pred))
mape <- mean(abs((actual-pred)/actual))
retvals <- list(rmse = rmse, mae = mae, mape = mape)
return(retvals)
}
### Our model performance metrics
OOS_lm <- calc_performance(model_test$Y, OOS_lm_model1)
OOS_lm
## $rmse
## [1] 0.5231223
##
## $mae
## [1] 0.3588577
##
## $mape
## [1] Inf
Our model gives a RMSE of 0.523.
Now we run our model on our test data set to predict the trip duration.
model_output <- predict(lm_model1, X_pred)
## taking inverse log transformation to get actual trip duration
output <- exp(model_output)
final_result <- cbind(test$id, output)
write.csv(final_result, file = "C:/Users/soumi/Documents/UCinn Course Material/BANA 6043 - Stats Computing/Final Project/Final Data/Final submission/final_prediction.csv")
Our final linear regression model is able to predict our trip duration values in our test data. From our detailed exploration of the different factors - weather, traffic, road conditions and trip characteristics we were able to identify trends in the data which seem to impact a cab trip duration.
The analysis shows that people are on the move a lot in New York (by taxi). The “quietest” times in New York City are in the wee hours of the morning on the weekends. Another interesting observation from our data is that majority of our pickups and drop off happens near JFK.
In this section we will cover the new functions we have applied to complete our Analysis
To generate an exhaustive set of correlation matrix and their significance we have use the rcorr function from the package Hmisc. For our training data set we required to understand the features which are highly correlated to each other before putting them in our model as that would result in fluctuation of our model result.
rcorr computes a matrix of Pearson’s r or Spearman’s rho rank correlation coefficients for all possible pairs of columns of a matrix. Missing values are deleted in pairs rather than deleting all rows of x having any missing variables. Ranks are computed using efficient algorithms, using midranks for ties.
Our model had final 9 categorical variable which showed impact on our response variable trip duration. As linear regression model can’t directly take categorical variables as is we had to encode these in a format which the model can understand. One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.This is where the integer encoded variable is removed and a new binary variable is added for each unique integer value.
We have used the function dummyVars from the caret package to encode our categorical variables. This function produces a dummy variables data set. We utilized it to break the categorical variables into columns with values of either 0 or 1, whereas if a specific category was present for the ride ID it would be marked as 1, the rest are marked 0 otherwise.
In our train data we had 2 date fields - pickup date time and dropoff date time, from which needed to extract month, day, day of week and hours to understand the different trends observed in trip duration with time.
In our datasource the date time field were in character format. To convert these in readable datetime format we used As.POSIXct from the library lubridate. I addition to this we also used format function from base R package to extract the relevant information for our time-based analysis.
We used the Profiling_num function from library funModeling to get a glimpse of statistical indicators for all numerical variables. We used it to get a quick digest list of mean, median, variability, quantiles, range, etc. for roads data.
We used the sapply function from the base R package to convert into matrix. sapply converts from data frames, vectors or lists to a matrix or vector. We took the traffic data vector to find null values and output a matrix with columns indicating how many nulls were present.